Model Driven Telemetry (MDT) diagnosis experiment¶

Prerequiste: Model Driven Telemetry data retrieved from a router, timestamp aligned and merged into a single file (merged.csv). The data is already filtered and only contains numeric counters.

This notebook performs the following steps:

  1. Load the data.
  2. Pre-process the data.
  3. Visualize the data as a 2D projection using t-distributed stochastic neighbor embeddings (t-SNE).
  4. Identify clusters using DBSCAN and associated transitions between the clusters, i.e., the system's change-points.
  5. Distill the key features/counters that best describe a change-point
  6. Describe the change-point in natural language - explaining what the issue is what could be done to resolve the issue.
In [1]:
%load_ext autoreload
%autoreload 2

Load dataset information¶

In [2]:
import modules.dataset as ds
from dotenv import load_dotenv

load_dotenv("env")
ds.extract_dataset('./datasets/mdt-demo.tgz', './output')
In [3]:
import modules.mdt.datasets as mdt_ds
datasets = mdt_ds.Datasets(datasets_dir='./output')
datasets.jupyter_select_dataset_device(select_file=False)

Available Datasets:


mdt-demo


Box(children=(Dropdown(description='Dataset:', layout=Layout(display='flex', justify_content='flex-start', wid…

MDT Merged Data¶

See mdt_data_process notebook for how the merged CSV is curated.

In [4]:
import pandas as pd
import modules.utils as utils
from io import StringIO

merged_data_fn, _ = datasets.get_input_data_file("merged.csv")

df = pd.read_csv(merged_data_fn)  

# show number of rows and columns - dimensionality
shape = df.shape
print("dataset dimensions: rows={}, columns={}".format(shape[0], shape[1]))
# display a sample of the dataset, first 10 rows with first 10 columns for each row.
utils.displayDataFrame(df.iloc[0:9,0:9])
dataset dimensions: rows=1079, columns=7334
ts.V1 n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-good-bytes n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-good-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-multicast-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-bytes n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from1024-to1518 n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from128-to255 n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from1519-to-max
1558249381.658611 513408648445952 121428366854.500000 70062.976587 513408648445952 121428366854.500000 16954910447.408417 500207085.414062 55821925436.250000
1558249391.658611 513493882415104 121439268000.000000 70063.975488 513493882415104 121439268000.000000 16954912136.550323 500306645.896484 55831735150.250000
1558249401.658611 513570679283712 121449025040.250000 70064.000000 513570679283712 121449025040.250000 16954914256.728149 500380898.357422 55840574311.500000
1558249411.658611 513647466450944 121458776629.500000 70064.000000 513647466450944 121458776629.500000 16954916430.509275 500454012.693359 55849411445.250000
1558249421.658611 513724222164992 121468531715.750000 70064.911685 513724222164992 121468531715.750000 16954918651.233582 500528780.119141 55858245583.500000
1558249431.658611 513800108363776 121478179924.000000 70065.000000 513800108363776 121478179924.000000 16954920910.709656 500602571.416016 55866977679.500000
1558249441.658611 513876775303168 121487920800.750000 70065.909252 513876775303168 121487920800.750000 16954923308.024719 500675612.757812 55875800239.250000
1558249451.658611 513952962519040 121497604673.250000 70066.908253 513952962519040 121497604673.250000 16954925578.743223 500749016.750000 55884566475.250000
1558249461.658611 514032355700736 121507688129.500000 70067.000000 514032355700736 121507688129.500000 16954927959.780272 500824318.638672 55893704392.500000

MDT Preprocessed Data¶

See mdt_data_process notebook for how the processed-offline CSV is curated.

The nature of the network data collected on routers is multi-variate and very heterogeneous in nature. Some counters are incremental (e.g., packet counts), some are percentages (e.g., CPU usage), with ranges varying (e.g., bytes count in the trillions, or booleans that can only be one or zero). An example of incremental data that ranges in the trillions can be found here.

In order to be able to compare information from different sources, preprocessing of the selected dataset include three consecutive steps, operating over the entire timeseries:

  • Order 1 difference for non-decreasing timeseries
  • Min-max scaling between 0 and 1
  • Exponential smoothing (with parameter 0.5)
In [5]:
preprocessed_data_fn, _ = datasets.get_input_data_file("preprocessed_offline.csv")

df = pd.read_csv(preprocessed_data_fn)

# show number of rows and columns - dimensionality
shape = df.shape
print("dataset dimensions: rows={}, columns={}".format(shape[0], shape[1]))
# display a sample of the dataset, first 10 rows with first 10 columns for each row.
utils.displayDataFrame(df.iloc[0:9,0:9])
dataset dimensions: rows=1079, columns=7334
ts n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-good-bytes n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-good-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-multicast-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-bytes n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-frames n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from1024-to1518 n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from128-to255 n0:Cisco-IOS-XR-drivers-media-eth-oper:ethernet-interface_statistics_statistic.csv:HundredGigE0/0/0/0:received-total-octet-frames-from1519-to-max
1558249381.658611 0.681327 0.687531 0.504115 0.681327 0.687531 0.451305 0.858389 0.681585
1558249391.658611 0.681327 0.687531 0.504115 0.681327 0.687531 0.451305 0.858389 0.681585
1558249401.658611 0.644663 0.648323 0.258243 0.644663 0.648323 0.517278 0.741312 0.644928
1558249411.658611 0.626289 0.628532 0.129121 0.626289 0.628532 0.558469 0.677508 0.626523
1558249421.658611 0.616965 0.618757 0.294610 0.616965 0.618757 0.586249 0.653254 0.617207
1558249431.658611 0.608525 0.610206 0.169590 0.608525 0.610206 0.606070 0.636611 0.608696
1558249441.658611 0.607697 0.609107 0.314231 0.607697 0.609107 0.637078 0.624820 0.607856
1558249451.658611 0.605199 0.606604 0.409198 0.605199 0.606604 0.633205 0.620602 0.605310
1558249461.658611 0.617882 0.619045 0.227750 0.617882 0.619045 0.648154 0.627273 0.618074

Changepoint Detector¶

Detect clusters using DBSCAN and the associated transitions of the system between the clusters.

In [6]:
from modules.mdt.data_utils import load_data, ORIGINAL_DATA
from modules.mdt.changepoint_detector import ChangepointDetector

tstp, dataframe = load_data(preprocessed_data_fn, scale=False, data_selection=ORIGINAL_DATA, ft_regex="^(?!.*(time|second)).*")

detector = ChangepointDetector(dataframe, datasets.get_device())
In [7]:
detector.detect()
detector.plot(withEvents=False)
In [8]:
detector.plot(withEvents=True)
detector.select_changepoints()
Box(children=(Dropdown(description='Changepoint Selection:', layout=Layout(display='flex', justify_content='fl…

Feature Selection¶

The selection problem, i.e., "which of the many features that change are the most descriptive for the change", is approached by optimizing an information-theoretic metric, i.e., cross-entropy. The goal here is to find the subset of features that describes best what is changing at the given timestamp. The intuition is that cross-entropy gives both the amount of additional information in the subset, and the divergence of the subset distribution from the original one. The added regularization term also allows for the tuning of the verbosity of the output.

More details can be found in T. Feltin, J. A. C. Fuertes, F. Brockners and T. H. Clausen, "Understanding Semantics in Feature Selection for Fault Diagnosis in Network Telemetry Data”, NOMS 2023 - 2023 IEEE/IFIP Network Operations and Management Symposium

In [9]:
from modules.mdt.retriever import Retriever
import modules.utils as utils
from IPython.display import clear_output


tstp, dataframe = load_data(merged_data_fn, scale=False, data_selection=ORIGINAL_DATA, ft_regex="^(?!.*(time|second|minute|hour|pid|port)).*",
                            remove_nan=True, remove_inf=True)

selected_changepoints = detector.get_changepoints()
retriever = Retriever(dataframe)
features = retriever.retrieve(selected_changepoints)
Module modules.mdt.explain_lib, development version
Module modules.mdt.selection_lib, development version

Running optimisation...
--------------------------------------------------
Total features considered: 104
Alpha: 2
--------------------------------------------------
Epoch 1 : Score = 17.16
Epoch 2 : Score = 17.16
Epoch 3 : Score = 17.16
In [10]:
mdt_changepoints = []

for feature, data in features.items():
    mdt_changepoints.append({
        "Event": f"{feature - tstp[0]}",
        "Features": '\n'.join(data),
        "Source": "MDT",
        'Type': "NETWORK_DEVICE"
    })

clear_output()
utils.displayDictionary(mdt_changepoints)
Event Features Source Type
4820.0 Cisco-IOS-XR-ip-bfd-oper:bfd_counters_packet-counters_packet-counter.csv:bfd-mgmt-pkt-display-type-none:HundredGigE0/0/0/16:0/0/CPU0:hello-receive-count CHANGE: 1.0 Cisco-IOS-XR-ip-bfd-oper:bfd_session-briefs_session-brief.csv:172.31.14.48:HundredGigE0/0/0/16:0/0/CPU0:0/0/CPU0:ip-single-hop:status-brief-information__async-interval-multiplier__negotiated-local-transmit-interval CHANGE: 1667000.0 Cisco-IOS-XR-ip-bfd-oper:bfd_session-briefs_session-brief.csv:172.31.14.48:HundredGigE0/0/0/16:0/0/CPU0:0/0/CPU0:ip-single-hop:status-brief-information__async-interval-multiplier__negotiated-remote-transmit-interval CHANGE: 1667000.0 Cisco-IOS-XR-ip-bfd-oper:bfd_summary.csv:::session-state__down-count CHANGE: 1.0 Cisco-IOS-XR-ip-bfd-oper:bfd_summary.csv:::session-state__up-count CHANGE: -1.0 MDT NETWORK_DEVICE

Changepoint / Feature Diagnoser¶

Leverage an LLM to turn the selected set of features along with the amplitude of change into a diagnosis and resolution in natural language.

In [ ]:
from modules.diagnose import *
from modules.logger import Logger
from modules.llm.azure_ai import AzureLlm
import logging
import os

logger = Logger(logging.INFO)
llm = AzureLlm(logger,os.getenv('AZURE_OPENAI_API_KEY'))
        
diagnoser = Diagnose(logger, llm)
diagnoser.setOutputInitialDiagnosis("Diagnosis")

diagnoser.run(mdt_changepoints, inject=True)
utils.displayDictionary(mdt_changepoints)
LLM Endpoint: https://traiage-dev-openai-gpt-35.openai.azure.com/
LLM Prompt: MDT Sensor Path Diagnosis
In [ ]: